NYPD Shooting Incident Data Analysis Report

Hello out there! The goal of this project is to clean, analyze, and model the data from the NYPD Shooting Incident Data Report, which contains all occurrences from 2006 to 2023.

Data Source: https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nypd_raw <- read_csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv")
## Rows: 28562 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): OCCUR_DATE, BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION...
## dbl   (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
## lgl   (1): STATISTICAL_MURDER_FLAG
## time  (1): OCCUR_TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nypd_raw)
summary(nypd_raw)
##   INCIDENT_KEY        OCCUR_DATE         OCCUR_TIME           BORO          
##  Min.   :  9953245   Length:28562       Length:28562      Length:28562      
##  1st Qu.: 65439914   Class :character   Class1:hms        Class :character  
##  Median : 92711254   Mode  :character   Class2:difftime   Mode  :character  
##  Mean   :127405824                      Mode  :numeric                      
##  3rd Qu.:203131993                                                          
##  Max.   :279758069                                                          
##                                                                             
##  LOC_OF_OCCUR_DESC     PRECINCT     JURISDICTION_CODE LOC_CLASSFCTN_DESC
##  Length:28562       Min.   :  1.0   Min.   :0.0000    Length:28562      
##  Class :character   1st Qu.: 44.0   1st Qu.:0.0000    Class :character  
##  Mode  :character   Median : 67.0   Median :0.0000    Mode  :character  
##                     Mean   : 65.5   Mean   :0.3219                      
##                     3rd Qu.: 81.0   3rd Qu.:0.0000                      
##                     Max.   :123.0   Max.   :2.0000                      
##                                     NA's   :2                           
##  LOCATION_DESC      STATISTICAL_MURDER_FLAG PERP_AGE_GROUP    
##  Length:28562       Mode :logical           Length:28562      
##  Class :character   FALSE:23036             Class :character  
##  Mode  :character   TRUE :5526              Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##    PERP_SEX          PERP_RACE         VIC_AGE_GROUP        VIC_SEX         
##  Length:28562       Length:28562       Length:28562       Length:28562      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    VIC_RACE           X_COORD_CD        Y_COORD_CD        Latitude    
##  Length:28562       Min.   : 914928   Min.   :125757   Min.   :40.51  
##  Class :character   1st Qu.:1000068   1st Qu.:182912   1st Qu.:40.67  
##  Mode  :character   Median :1007772   Median :194901   Median :40.70  
##                     Mean   :1009424   Mean   :208380   Mean   :40.74  
##                     3rd Qu.:1016807   3rd Qu.:239814   3rd Qu.:40.82  
##                     Max.   :1066815   Max.   :271128   Max.   :40.91  
##                                                        NA's   :59     
##    Longitude        Lon_Lat         
##  Min.   :-74.25   Length:28562      
##  1st Qu.:-73.94   Class :character  
##  Median :-73.92   Mode  :character  
##  Mean   :-73.91                     
##  3rd Qu.:-73.88                     
##  Max.   :-73.70                     
##  NA's   :59
colSums(is.na(nypd_raw))
##            INCIDENT_KEY              OCCUR_DATE              OCCUR_TIME 
##                       0                       0                       0 
##                    BORO       LOC_OF_OCCUR_DESC                PRECINCT 
##                       0                   25596                       0 
##       JURISDICTION_CODE      LOC_CLASSFCTN_DESC           LOCATION_DESC 
##                       2                   25596                   14977 
## STATISTICAL_MURDER_FLAG          PERP_AGE_GROUP                PERP_SEX 
##                       0                    9344                    9310 
##               PERP_RACE           VIC_AGE_GROUP                 VIC_SEX 
##                    9310                       0                       0 
##                VIC_RACE              X_COORD_CD              Y_COORD_CD 
##                       0                       0                       0 
##                Latitude               Longitude                 Lon_Lat 
##                      59                      59                      59
nypd = subset(nypd_raw, select=-c(X_COORD_CD, Y_COORD_CD, Lon_Lat))
nypd
library(dplyr)

nypd <- distinct(nypd)
nypd <- nypd %>% replace(.== "NULL", "UNKNOWN")

nypd <- nypd %>% replace(.== "(null)", "UNKNOWN")

# nypd <- nypd %>% replace(.== "UNKNOWN", NA)
unique(nypd$JURISDICTION_CODE)
## [1]  0  2  1 NA
nypd <- nypd %>% drop_na(JURISDICTION_CODE)
nypd <- nypd %>% drop_na(Latitude)
nypd <- nypd %>% drop_na(OCCUR_DATE)

nypd <- nypd %>% drop_na(OCCUR_TIME)
nypd[is.na(nypd)] <- "UNKNOWN"

nypd
all_cols <- lapply(nypd, unique)

unique_cols <- lengths(all_cols)

unique_cols
##            INCIDENT_KEY              OCCUR_DATE              OCCUR_TIME 
##                   22349                    6093                    1423 
##                    BORO       LOC_OF_OCCUR_DESC                PRECINCT 
##                       5                       3                      77 
##       JURISDICTION_CODE      LOC_CLASSFCTN_DESC           LOCATION_DESC 
##                       3                      10                      40 
## STATISTICAL_MURDER_FLAG          PERP_AGE_GROUP                PERP_SEX 
##                       2                       9                       4 
##               PERP_RACE           VIC_AGE_GROUP                 VIC_SEX 
##                       7                       7                       3 
##                VIC_RACE                Latitude               Longitude 
##                       7                   13385                   13373

Data Visualizations

Here we are performing some more data cleaning and using those cleaned variables to create some visualizations that tell a story about the data overall.

unique(nypd$VIC_AGE_GROUP)
## [1] "25-44"   "18-24"   "45-64"   "65+"     "<18"     "UNKNOWN" "1022"
nypd <- nypd[!(nypd$VIC_AGE_GROUP %in% "1022"), ]
library(ggplot2)

ggplot(data=nypd, aes(x=VIC_AGE_GROUP)) + 
  geom_bar(fill = 'lightblue') + 
  ggtitle("Victim Age Groups") + 
  xlab("Age Group") + 
  ylab("Total") + theme_minimal()

nypd <- nypd[!(nypd$PERP_AGE_GROUP %in% c("940", "224", "1020")), ]

nypd$PERP_SEX[nypd$PERP_SEX == "U"] <- "UNKNOWN"

ggplot(data=nypd, aes(x=PERP_AGE_GROUP)) + 
  geom_bar(fill = 'purple') + 
  ggtitle("Perpetrator Age Groups") + 
  xlab("Age Group") + 
  ylab("Total") + theme_minimal()

It looks like the majority of perpetrators do not belong to a specific age group. But the rest are similar in age to their victims, 18-24 and 25-44.

ggplot(nypd, aes(x=Latitude, y=Longitude)) + 
  geom_point(aes(color=factor(BORO))) + 
  ggtitle("Location of Occurrence by Borough") + 
  labs("Borough") + theme_minimal()

ggplot(data=nypd, aes(x=BORO)) + 
  geom_bar(fill = 'pink') + 
  ggtitle("Occurrences by Borough") + 
  xlab("Borough") + 
  ylab("Total") + theme_minimal()

The Bronx and Brooklyn have more cases than the rest of the boroughs.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
tabyl(nypd, BORO, VIC_AGE_GROUP)
ggplot(data=nypd, aes(x=STATISTICAL_MURDER_FLAG)) + geom_bar(fill = 'blue') +   
  ggtitle("Statistical Murder Flags") + 
  xlab("Murder Flagged") + 
  ylab("Total") + theme_minimal()

Here we can see that there is a large imbalance between occurrences flagged as murder and those that are not flagged. The majority of events are not flagged.

tabyl(nypd, PERP_AGE_GROUP, PERP_RACE, PERP_SEX)
## $F
##  PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
##             <18                              0                        0    26
##           18-24                              1                        1    92
##           25-44                              0                        4   144
##           45-64                              0                        0    20
##             65+                              0                        0     0
##         UNKNOWN                              0                        0    10
##  BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
##               2       0     1             10
##              17       1     5             43
##               7       0    12             32
##               1       0     1              4
##               1       0     0              0
##               2       4     0              2
## 
## $M
##  PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
##             <18                              0                       14  1215
##           18-24                              0                       48  4479
##           25-44                              1                       81  4186
##           45-64                              0                       10   427
##             65+                              0                        0    30
##         UNKNOWN                              0                       11  1231
##  BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
##             144      14     9            235
##             606      56    47           1012
##             464      56   136            902
##              47       5    58            124
##               4       1    18             11
##              91     221    11            125
## 
## $UNKNOWN
##  PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
##             <18                              0                        0     1
##           18-24                              0                        0     7
##           25-44                              0                        0     3
##           45-64                              0                        0     0
##             65+                              0                        0     0
##         UNKNOWN                              0                        0     7
##  BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
##               0       2     0              0
##               2       7     0              0
##               0       4     0              0
##               0       0     0              0
##               0       0     0              0
##               0   11891     0              0
# library(vtree)

# vtree(nypd, c("PERP_AGE_GROUP", "PERP_SEX", "PERP_RACE"), horiz = TRUE, fillcolor = c(PERP_AGE_GROUP = "#e7d4e8", PERP_SEX = "#99d8c9", PERP_RACE = "#9ecae1"), keep = list(PERP_SEX = c("M", "F", "UNKNOWN")), showcount = FALSE)
# vtree(nypd, c("BORO", "LOC_OF_OCCUR_DESC"), fillcolor = c(BORO="#e7d4e8", LOC_OF_OCCUR_DESC="#99d8c9"), horiz=TRUE)
# nypd[["OCCUR_DATE"]] <- as.POSIXct(nypd[["OCCUR_DATE"]], format ="%m-%d-%Y")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(CGPfunctions)

ggplot(data=nypd, aes(x=PRECINCT)) + 
  geom_bar(fill = 'brown') +   
  ggtitle("Occurrences by Precinct") + 
  xlab("Precinct Number") + 
  ylab("Total") + theme_minimal()

# PlotXTabs(nypd, LOC_OF_OCCUR_DESC, BORO) 
# PlotXTabs(nypd, PERP_SEX, PERP_AGE_GROUP)
# PlotXTabs(nypd, VIC_SEX, VIC_AGE_GROUP)
library(lubridate)

nypd_date <- nypd %>% mutate(date=mdy(OCCUR_DATE))
nypd_date

Time Series Visualizations

nypd_total <- sqldf('SELECT date, COUNT(date) as total FROM nypd_date GROUP BY date ORDER BY date')

nypd_total
nypd_cumsum_total <- nypd_total %>% mutate(cum_total=cumsum(total))

nypd_cumsum_total <- subset(nypd_cumsum_total, select=-c(total))

nypd_cumsum_total

The two graphs below are interactive and show that cases have steadily increased (not too fast!) over the years but there is a slight dip in June 2020 followed by an increase in July 2020.

library(TSstudio)

ts_plot(nypd_cumsum_total, 
        title = "NYPD Historic Shooting Data Cumulative Totals 2006-2023", 
        Xtitle = "Year", 
        Ytitle = "Cumulative Total Events", 
        slider = TRUE)
ts_plot(nypd_total, 
        title = "NYPD Historic Shooting Data Daily Totals", 
        Xtitle = "Date", 
        Ytitle = "Total", 
        slider = TRUE)

Data Modeling

Logistic Regression

After converting our feature and target variables using one-hot encoding methods, we’ll be splitting the data 80-20% for training and testing.

If the logistic regression model gives an installation error, try install.packages(‘glmnet’, dependencies=TRUE, type=“binary”).

library(datasets) 
library(caTools)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tune         1.2.1
## ✔ infer        1.0.7     ✔ workflows    1.1.4
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.2.1     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()     masks purrr::discard()
## ✖ magrittr::extract()   masks tidyr::extract()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ parsnip::fit()        masks infer::fit(), party::fit(), modeltools::fit()
## ✖ recipes::fixed()      masks stringr::fixed()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ tune::parameters()    masks dials::parameters(), modeltools::parameters()
## ✖ magrittr::set_names() masks purrr::set_names()
## ✖ yardstick::spec()     masks readr::spec()
## ✖ recipes::step()       masks stats::step()
## ✖ recipes::update()     masks stats4::update(), stats::update()
## ✖ party::where()        masks dplyr::where()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(readr)
nypd_final <- subset(nypd_date, 
                     select=-c(INCIDENT_KEY, 
                               OCCUR_DATE, 
                               Longitude, 
                               Latitude, 
                               OCCUR_TIME, 
                               date))

nypd_final
tar_var <- nypd_final[["STATISTICAL_MURDER_FLAG"]]

nypd_final <- subset(nypd_final, select=-c(STATISTICAL_MURDER_FLAG))

nypd_final
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:zoo':
## 
##     yearmon, yearqtr
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
dummy <- dummyVars(" ~ .", data=nypd_final)

nypd_new <- data.frame(predict(dummy, newdata = nypd_final))
nypd_new$STATISTICAL_MURDER_FLAG = c(tar_var)
nypd_new$STATISTICAL_MURDER_FLAG <- as.factor(as.logical(nypd_new$STATISTICAL_MURDER_FLAG))
nypd_new
set.seed(42)

nypd_split <- sample.split(nypd_new, SplitRatio = 0.8)

train_data <- subset(nypd_new, nypd_split == TRUE)
test_data <- subset(nypd_new, nypd_split == FALSE)
rmodel <- logistic_reg(mixture = double(1), penalty = double(1)) %>% set_engine("glmnet") %>% set_mode("classification") %>% fit(STATISTICAL_MURDER_FLAG ~ ., data = train_data)

tidy(rmodel)
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
pred_class <- predict(rmodel, new_data = test_data, type = 'class')

pred_proba <- predict(rmodel, new_data = test_data, type = 'prob')

results <- test_data %>% select(STATISTICAL_MURDER_FLAG) %>% bind_cols(pred_class, pred_proba)
accuracy(results, truth = STATISTICAL_MURDER_FLAG, estimate = .pred_class)
conf_mat(results, truth = STATISTICAL_MURDER_FLAG, estimate = .pred_class)
##           Truth
## Prediction FALSE TRUE
##      FALSE  4717 1074
##      TRUE     17   19
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape
results$gtruth <- as.integer(as.logical(results$STATISTICAL_MURDER_FLAG))
results$pred_class <- as.integer(as.logical(results$.pred_class))

precision(results$gtruth, results$pred_class)
## [1] 0.5277778
accuracy(results$gtruth, results$pred_class)
## [1] 0.8127681
recall(results$gtruth, results$pred_class)
## [1] 0.01738335
auc(results$gtruth, results$pred_class)
## [1] 0.5068962
rmse(results$gtruth, results$pred_class)
## [1] 0.432703

A Note on Biases

When it comes to data cleaning and manipulation, regardless of the topic, biases always find a way to influence our decisions when it comes to how or why we perform certain procedures. When I was looking at this data set, I noticed that the majority of shooting events reported were from people ages 18-44 with the reminder being under 18 and 45+. This leaves teenagers under 18 and the elderly susceptible to bias when building the model since there isn’t a lot of data available for them, making it easier for analysts and scientists to overlook their importance. The same thing can be said for perpetrators although their bias comes from lack of information since the majority of them do not have an assigned age group. In addition, I noticed that the majority of crimes were committed by males, which can easily make any future models biased against males and more lenient towards females. But when it comes to crimes reported, most of those are also from males, making females an incredibly underrepresented group. This could lead to all sorts of problems for models since the odds are overwhelmingly in favor of one gender over another. It’s fair to say that most of these biases are from gaps in data and other unknown factors. In many cases, we cannot go back in time to retrieve any lost data so it’s important to look at what’s available with impartiality to ensure the best possible outcome.

results